Introduction

This step of the BDC workflow extracts the collection year whenever possible from complete and legitimate date information, and flags dubious (e.g., 07/07/10), illegitimate (e.g., 1300, 2100), or not supplied (e.g., 0 or NA) collecting year.


Important:

The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.

Installation

You can install the released version of BDC from github with:

if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")

Creating folders to save the results.

Read the database

Read the database created in the Space step of the BDC workflow. It is also possible to read any datasets containing the **required** fields to run the workflow (more details here).

database <-
  qs::qread("Output/Intermediate/03_space_database.qs")

Standardization of character encoding.

for (i in 1:ncol(database)){
  if(is.character(database[,i])){
    Encoding(database[,i]) <- "UTF-8"
  }
}


1 - Records lacking event date information

VALIDATION. This function flags records lacking event date information (e.g., empty or NA).

check_time <-
  bdc_eventDate_empty(data = database, eventDate = "verbatimEventDate")
#> 
#> bdc_eventDate_empty:
#> Flagged 3179 records.
#> One column was added to the database.

2 - Extract year from event date

ENRICHMENT. This function extracts four-digit year from unambiguously interpretable collecting dates.

check_time <-
  bdc_year_from_eventDate(data = check_time, eventDate = "verbatimEventDate")
#> 
#> bdc_year_from_eventDate:
#> Four-digit year were extracted from 2933 records.

3 - Records with out-of-range collecting year

VALIDATION. This function identifies records with illegitimate or potentially imprecise collecting year. The year provided can be out-of-range (e.g., in the future) or collected before a specified year supplied by the user (e.g., 1900). Older records are more likely to be imprecise due to the locality-derived geo-referencing process.

check_time <-
  bdc_year_outOfRange(data = check_time,
                      eventDate = "year",
                      year_threshold = 1900)
#> 
#> bdc_year_outOfRange:
#> Flagged 12 records.
#> One column was added to the database.

Report

Creating a column named .summary summing up the results of all VALIDATION tests. This column is FALSE witharecord is flagged as FALSE in any data quality test (i.e. potentially invalid or suspect record).

check_time <- bdc_summary_col(data = check_time)
#> Column '.summary' already exist. It will be updated
#> 
#> bdc_summary_col:
#> Flagged 3481 records.
#> One column was added to the database.



Creating a report summarizing the results of all tests of the BDC workflow

# If you choose to not keep the columns containing the results of tests of the previous steps of the workflow, inform only "time" in  workflow_step argument.
report <-
  bdc_create_report(data = check_time,
                    database_id = "database_id",
                    workflow_step = c("prefilter", "taxonomy", "space", "time"))
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the taxonomy in:
#> Output/Report
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the space in:
#> Output/Report
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the time in:
#> Output/Report

report


Figures

Creating a histogram showing the number of records collecting over the years.

bdc_create_figures(data = check_time,
                   database_id = "database_id",
                   workflow_step = "time")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures


Number of records sampled over the years


Summary of all tests of the time step; note that some database lack event date information


Summary of all validation tests of the BDC workflow


Save a “raw” database

Save the original database containing the results of all data quality tests appended in separate columns.

check_time %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "04_time_raw_database.qs"))

Filter the database

Let’s remove potentially erroneous or suspect records flagged by the data quality tests applied in all steps of the BDC workflow to get a “clean”, “fitness-for-use” database. Note that 29% (2,631 out of 9.000 records) of original records were considered “fitness-for-use” after the data-cleaning process.

output <-
  check_time %>%
  dplyr::filter(.summary == TRUE) %>%
  bdc_filter_out_flags(data = ., col_to_remove = "all")
#> 
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .uncer_terms, .val, .equ, .zer, .cap, .cen, .urb, .otl, .gbf, .inst, .dpl, .rou, .eventDate_empty, .year_outOfRange, .summary

Save a “fitness-for-use” database

output %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "04_time_clean_database.qs"))